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We consider optimal sequential allocation in the context of the 
so-called stochastic multi-armed bandit model. We describe a generic 
index policy, in the sense of Gittins (1979), based on upper confidence 
Qh ' bounds of the arm payoffs computed using the KuUback-Leibler di- 

. vergence. We consider two classes of distributions for which instances 

of this general idea are analyzed: The kl-UCB algorithm is designed 
for one-parameter exponential families and the empirical KL-UCB al- 
gorithm for bounded and finitely supported distributions. Our main 
contribution is a unified finite-time analysis of the regret of these al- 
gorithms that asymptotically matches the lower bounds of Lai and 
^ ■ Robbins (1985) and Burnetas and Katehakis (1996), respectively. We 

\^ ' also investigate the behavior of these algorithms when used with gen- 

Qf) , eral bounded rewards, showing in particular that they provide signif- 

icant improvements over the state-of-the-art. 

^ \ 

o; 

1. Introduction. This paper is about optimal sequential allocation in 
■ unknown random environments. More precisely, we consider the setting 

known under the conventional, if not very explicit, name of (stochastic) 
multi-armed bandit, in reference to the 19th century gambling game. In the 
^ ■ multi-armed bandit model, the emphasis is put on focussing as quickly as 

possible on the best available option(s) rather than on estimating precisely 
the efficiency of each option. These options are referred to as arms and 
each of them is associated with a distribution; arms are indexed by a and 
associated distributions are denoted by z/q. 

The archetypal example occurs in clinical trials where the options (or 
arms) correspond to available treatments whose efficiencies are unknown a 
priori and patients arrive sequentially; the action consists of prescribing a 
particular treatment to the patient and the observation corresponds (for 
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instance) to the success or failure of the treatment. The goal is clearly here 
to achieve as many successes as possible. A strategy for doing so is said to 
be anytime if it does not require to know in advance the number of patients 
that will participate to the experiment. Although the term multi-armed 
bandit was probably coined in the late 1960's (Gittins, 1979), the origin of 
the problem can be traced back to fundamental questions about optimal 
stopping policies in the context of clinical trials (see Thompson, 1933, 1935) 
raised since the 1930's (see also Wald, 1945; Robbins, 1952). 

In his celebrated work, Gittins (1979) considered the B ay esian- optimal 
solution to the discounted infinite-horizon multi-armed bandit problem. Git- 
tins first showed that the Bayesian optimal policy could be determined by 
dynamic programming in an extended Markov decision process. The second 
key element is the fact that the optimal policy search can be factored into 
a set of simpler computations to determine indices that fully characterize 
each arm given the current history of the game (Gittins, 1979; Whittle, 1980; 
Weber, 1992). The optimal policy is then an index policy in the sense that 
at each time round, the (or an) arm with highest index is selected. Hence, 
index policies only differ in the way the indices are computed. 

From a practical perspective however, the use of Gittins indices is limited 
to specific arm distributions and is computationally challenging (Gittins, 
Glazebrook and Weber, 2011). In the 1980's, pionnering works by Lai and 
Robbins (1985), Chang and Lai (1987), Burnetas and Katehakis (1996, 1997, 
2003) suggested that Gittins indices can be approximated by quantities that 
can be interpreted as upper bounds of confidence intervals. This insight was 
used in the landmark paper by Auer, Cesa-Bianchi and Fischer (2002) who 
popularized the acronym UCB (for upper confidence bounds) to refer to a 
particular variant of indices obtained using Hoeffding's inequality. 

There are however significant differences between the algorithms and re- 
sults of Gittins (1979) and Auer, Cesa-Bianchi and Fischer (2002). First, 
UCB is an anytime algorithm that does not rely on the use of a discount 
factor or even on the knowledge of the horizon of the problem. More signif- 
icantly, the Bayesian perspective is absent and UCB is analyzed in terms of 
its frequentist (distribution-dependent or distribution-free) performance, by 
exhibiting so-called finite-time, non-asymptotic, regret bounds. 

UCB is a very robust algorithm that is suitable for and has strong per- 
formance guarantees, including distribution-free ones, in all problems with 
bounded stochastic rewards. However, a closer examination of the arguments 
in the proof reveals that the form of the upper confidence bounds used in 
UCB is a direct consequence of the use of Hoeffding's inequality and signifi- 
cantly differs from the approximate form of Gittins indices suggested by Lai 
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and Robbins (1985) or Burnetas and Katehakis (1996). Furthermore, the 
frequentist asymptotic lower bounds for the regret obtained by these au- 
thors also suggest that the behavior of UCB can be far from optimal. Indeed, 
under suitable conditions on the model T> (the class of possible distributions 
associated with each arm), any policy that is "admissible" (i.e., not grossly 
under-performing, see Lai and Robbins, 1985 for details) must satisfy the 
following asymptotic inequality on its "regret" (a quantity to be defined in 
Section 2): 



where Rt is the regret at round T and fia denotes the expectation of the 
distribution z/q of arm a, while /i* is the maximal expectation among all 
arms. The quantity 



which measures the difficulty of the problem, is the minimal KuUback-Leibler 
divergence between the arm distribution and distributions in the model T> 
that have expectations larger than /i. By comparison, the bound obtained 
in Auer, Cesa-Bianchi and Fischer (2002) for UCB is of the form 



for some numerical constant C, e.g., C = 8 (we provide a refinement of 
the result of Auer, Cesa-Bianchi and Fischer, 2002 as Corollary 2 below). 
These two results coincide as to the logarithmic rate of the regret but the 
(distribution-dependent) constants differ, sometimes significantly. Based on 
this observation, Honda and Takemura (2010) proposed an algorithm, called 
DMED, that is not an index policy but was shown to improve over UCB in some 
situations. 

Building on similar ideas, we show in this paper that for a large class of 
problems there does exist a generic index policy — already suggested by Lai 
and Robbins (1985) and Burnetas and Katehakis (2003) — that guarantees 
a bound on the expected regret of the form 



(1) 




(2) 



/Ci„f(z^,^) = inf] KL(i/, I/') : I/' G P and E{u') > fi 





log(r) + o(log(r)) 
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and which is thus asymptoticahy optimal. Interestingly, the index used in 
this algorithm can be interpreted as the upper bound of a confidence re- 
gion for the expectation constructed using an empirical likelihood principle 
(Owen, 2001). 

We describe the implementation of this algorithm and analyze its perfor- 
mance in two practically important cases where the lower bound of (1) was 
shown to hold (Lai and Robbins, 1985; Burnetas and Katehakis, 1996) — 
namely, for one-parameter canonical exponential families of distributions 
(Section 4), in which case the algorithm is referred to as kl-UCB; and for 
finitely supported distributions (Section 5), where the algorithm is called 
empirical KL-UCB. Determining the empirical KL-UCB index requires solving 
a convex program (maximizing a linear function on the probability simplex 
under Kullback-Leibler constraints) for which we provide in Appendix C.l 
a simple algorithm inspired by Filippi, Cappe and Garivier (2010). 

The analysis presented here greatly improves over the preliminary results 
presented, on the one hand in Garivier and Cappe (2011), and on the other 
hand in Maillard, Munos and Stoltz (2011); more precisely, the improve- 
ments lie in the greater generality of the analysis and by the more precise 
evaluation of the remainder terms in the regret bounds. We believe that 
the result obtained in this paper for kl-UCB (Theorem 1) is not improv- 
able. For empirical KL-UCB the bounding of the remainder term could be 
improved upon obtaining a sharper version of the contraction lemma for 
/Cinf (Lemma 6). Our numerical simulations indeed suggest that the factor 
(n + 2) appearing in its bound may be superfluous. The proofs rely on results 
of independent statistical interest: non- asymptotic bounds on the level of se- 
quential confidence intervals for the expectation of independent, identically 
distributed variables in canonical exponential families (Lemma 5) using, in 
the bounded case, the empirical likelihood method (Proposition 1). 

For general bounded distributions, we further make three important ob- 
servations. First, the particular instance of the kl-UCB algorithm based on 
the Kullback-Leibler divergence between normal distributions is the UCB 
algorithm, which allows us to provide an improved optimal finite-time anal- 
ysis of its performance (Corollary 2). Next, the kl-UCB algorithm, when 
used with the Kullback-Leibler divergence between Bernoulli distributions, 
obtains a strictly better performance than UCB, for any bounded distribution 
(Corollary 1). Finally, although a complete analysis of the empirical KL-UCB 
algorithm is subject to further investigations, we show here that the em- 
pirical KL-UCB index is indeed a valid empirical-likelihood upper-confidence 
bound for the expectation, that is, that it has a guaranteed, non-asymptotic 
bound on the coverage probability, for any bounded distribution (Propo- 
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sition 1). We provide some empirical evidence that empirical KL-UCB also 
performs well for general bounded distributions and illustrate the tradeoffs 
around the use of the two algorithms, in particular for short horizons. 

Outline. The paper is organized as follows. Section 2 introduces the neces- 
sary notations and defines the notion of regret. Section 3 presents the generic 
form of the KL-UCB algorithm and provides the main steps for its analysis, 
leaving two facts to be proven under each specific instantiation of the al- 
gorithm to a model. The kl-UCB algorithm in the case of one-dimensional 
exponential families is considered in Section 4, and the empirical KL-UCB 
algorithm for bounded and finitely supported distributions is presented in 
Section 5. Finally, the behavior of these algorithms in the case of general 
bounded distributions is investigated in Section 6; and numerical experi- 
ments comparing kl-UCB and empirical KL-UCB to their competitors are 
reported in Section 7. Proofs are postponed to the appendix (Sections A- 
C). 

2. Setup and notation. We consider a bandit problem with finitely 
many arms indexed by a G {1, . . . , K}, with K ^ 2, each associated with 
an (unknown) probability distribution i/a over M. We assume however that 
a model P is known: a family of probability distributions such that fa G P 
for all arms a. 

The game is sequential and goes as follows: At each round t ^ 1, the 
player picks an arm At (based on the information gained in the past) and 
receives a stochastic payoff It drawn independently at random according to 
the distribution v^^. He only gets to see the payoff Yf. 

2.1. Assessment of the quality of a strategy via its expected regret. For 
each arm a £ {1, ... , K} , we denote by Ha the expectation of its associated 
distribution i/a and we let a* be any optimal arm, i.e., 

a* G argmax fia ■ 

ae{l,...,K} 

We write as a short-hand notation for the largest expectation fia* and 
denote the gap of the expected payoff fia of an arm a to fi* as = fi* — fia- 
In addition, the number of times each arm a is pulled between the rounds 1 
and T is referred to as Na{T), 

iVa(T)^='^I{^,=a}. 
t=l 
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The quality of a strategy will be evaluated through the standard notion 
of expected regret, which we define formally now. The expected regret (or 
simply, regret) at round T ^ 1 is defined as 



(3) 



Rt = E 



t=i 



E 



t=i 



K 



^A,E[iV,(r)] , 



a=l 



where we used the tower rule for the first equality. Note that the expectation 
is with respect to the random draws of the Yt according to the and also 
to the possible auxiliary randomizations that the decision-making strategy 
is resorting to. 

The regret measures the cumulative loss resulting from pulling suboptimal 
arms, and thus quantifies the amount of exploration required by an algorithm 
in order to find a best arm, since, as (3) indicates, the regret scales with the 
expected number of pulls of suboptimal arms. 



2.2. Empirical distributions. We will denote them in two related ways, 
depending on whether random averages indexed by the global time t or 
averages of a given number n of pulls of a given arms are considered. The first 
series of averages will be referred to by using a functional notation for the 
indexation in the global time: I'ait), while the second series will be indexed 
with the local times n in subscripts: Pa,n- These two related indexations, 
functional for global times and random averages versus subscript indexes 
for local times, will be consistent throughout the paper for all quantities at 
hand, not only empirical averages. 

More formally, for all arms a and all rounds t such that Na{t) ^ 1, 

1 * 

'^'^^^^ = WTt)^^^^hAs=a}, 
^ s=l 

where 5x denotes the Dirac distribution on x G M. 

For averages based on local times we need to introduce stopping times. To 
that end, we consider the filtration {J-t), where for all t ^ 1, the cr-algebra 
J't is generated by Ai, Yi, . . ., At, Yt. In particular, At+i and all Na(t + 1) 
are J-t -measurable. For all n ^ 1, we denote by Ta,n the round at which a 
was pulled for the n-th time; since 

Ta,n = min{t ^ 1 : Na{t) = n} , 

we see that {Ta,n = t} is J^t-i~iiieasurable. That is, each random variable 
Ta^s is a (predictable) stopping time. Hence, as shown for instance in (Chow 
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and Teicher, 1988, Section 5.3), the random variables Xa^n = ^ra.,i) where 
n = 1,2,..., are independent and identically distributed according to Ua- 
For all arms a, we then denote by 

1 " 

k=l 

the empirical distributions corresponding to local times n ^ 1. 
All in all, we of course have the rewriting 

l'a(t) = T^a,Na{t) ■ 

3. The KL-UCB algorithm. We fix an interval or discrete subset 5 C M 
and denote by Tli{S) the set of all probability distributions over S. For two 
distributions i/, z/' G 5Jti(5), we denote by KL(i/, i/') their Kullback-Leibler 
divergence and by E(i/) and E(z^') their expectations. (This expectation op- 
erator is denoted by E while expectations with respect to underlying ran- 
domizations are referred to as E.) 

Algorithm 1: The KL-UCB algorithm (generic form). 

Parameters: An operator Uxi ■ 9Jli(<S) — s> D; a non-decreasing function / : N — >■ R 
Initialization: Pull each arm of {1, . . . ,K} once 

for t = K to T - 1, do 

compute for each arm a the quantity 

(4) f/aW = sup|e(!/) : ueV and KL(ni,(Pa(t)), i') < ^^^| 

pick an arm At+i G argmax Ua{t) 

ae{l,...,K} 



The generic form of the algorithm of interest in this paper is described 
as Algorithm 1. It relies on two parameters: an operator Hj) (in spirit, a 
projection operator) that associates with each empirical distribution Va{t) an 
element of the model P; and a non-decreasing function /, which is typically 
such that f{t) ^ log(t). 

At each round t ^ K, an upper confidence bound Ua{t) is associated with 
the expectation fia of the distribution Ua of each arm; an arm At-^i with 
highest upper confidence bound is then played. Note that the algorithm 
does not need to know the time horizon T in advance. Furthermore, the 
UCB algorithm of Auer, Cesa-Bianchi and Fischer (2002) may be recovered 
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by replacing KL(llx> (Pa(i)) , i^) with a quantity proportional to (E(Pa(i)) — 
E(i/))^; the implications of this observation will be made more explicit in 
Section 6. 

3.1. General analysis of performance. In Sections 4 and 5, we prove non- 
asymptotic regret bounds for Algorithm 1 in two different settings. These 
bounds match the asymptotic lower bound (1) in the sense that, according 
to (3), bounding the expected regret is equivalent to bounding the number 
of suboptimal draws. We show that, for any suboptimal arm a, we have 

E[iv,(r)]^ y\ (1 + 0(1)), 

/Cinf(j^a,/^*) 

where the quantity /Cjnf (fa, A**) was defined in the introduction. This result 
appears as a consequence of non-asymptotic bounds, which are derived using 
a common analysis framework detailed in the rest of this section. 

Note that the term log(r)//Cinf (z^^, /_f*) has an heuristic interpretation in 
terms of large deviations, which gives some insight on the regret analysis to 
be presented below. Let i^' £ V he such that E(i/') ^ /i*, let X[, . . . ,X'^ be 
independent variables with distribution v' , and let P'^ = {6x[ + • • • + 5x^)/n-. 
By Sanov's theorem, for a small neighborhood Va of i^a, the probability that 
z?4 belongs to Va is such that 

-ilogPjP; G Va} inf KL(l/,l/') « KUl^a,!^') ^ /Ci„f(l/a,/i') • 

n n-s-oo ueVa 

In the limit, ignoring the sub-exponential terms, this means that for n = 
log(r)//Cinf (z^a; A**), the probability P{i^^ E Va} is smaller than 1/T. Hence, 
log(T)//Cinf (fa, A**) appears as the minimal number n of draws ensuring 
that the probability under any distribution with expectation at least fi* of 
the event "the empirical distribution of n independent draws belongs to 
a neighborhood of fa" is smaller than 1/T. This event, of course, has an 
overwhelming probability under fa- The significance of 1/T as a cutoff value 
can be understood as follows: if the suboptimal arm a is chosen along the 
T draws, then the regret is at most equal to (/x* — fJ'a)T; thus, keeping the 
probability of this event under 1/T bounds the contribution of this event to 
the average regret by a constant. Incidentally, this explains why knowing /x* 
in advance does not significantly reduce the number of necessary suboptimal 
draws. The analysis that follows shows that the bandit problem, despite its 
sequential aspect and the absence of prior knowledge on the expectation of 
the arms, is indeed comparable to a sequence of tests of level 1 — 1/T with 
null hypothesis Hq : E(f') > fj,* and alternative hypothesis Hi : v' = fa, for 
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which Stein's lemma (see, e.g., van der Vaart 2000, Theorem 16.12) states 
that the best error exponent is /Cmf ('^a, A**)- 

Let us now turn to the main hues of the regret proof. By definition of 
the algorithm, at rounds t ^ K, one has ^t+i = a only if Ua{t) ^ Ua*{t). 
Therefore, one has the decomposition 

(5) {At+i = a} C {/xt ^ Ua*{t)} U {/xt < Ua*{t) and At+i = a} 
C {/xt ^ Ua*{t)] U {/xt < Ua{t) and At+i = a} , 

where /x^ is a parameter which is taken either equal to /x*, or slightly smaller 
when required by technical arguments. The event j/x^ < [/^(t)} can be 
rewritten as 



. fit) 



{/xt < Ua{t)} = S^3u'eV: E{u') > fi^ and KL(Uv{Mt)), ^' j - ^ 

= |pa(t) G C^tJ{t)/Na{t)} = {^a,Na{t) ^ C f^t , f{t)/Na{t)} 

where for /i S M and 7 > 0, the set C^^^ is defined as 



(6) Cf,^j = 

e 9JIi(cS) -.Bu' eV with E{u') > fi and KL(nx,(z^), z^') ^ 7} . 
By definition of /Cinf , 

(7) C^,^ C {z. G : /Ci„f(n2,(z.),/i) ^ 7} • 

Using (5), and recalling that for rounds t G {1, . . . ,K}, each arm is played 
once, one obtains 

T-l 

E[NT{a)] ^ 1 + ^P{^t ^ Ua*{t)} 

t=K 

+ e C^t,/(t)/7v,(t) and ^f+i = a| . 

t=K 

The two sums in this decomposition are handled separately. The first sum 
is negligible with respect to the second sum: case-specific arguments, given 
in Sections 4 and 5, prove the following statement. 

Fact to be proven 1. For proper choices ofIlx>, f, and /x^, the sum 
^P{^^ ^ Ua*{t)^ is negligible with respect to logT. 
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The second sum is thus the leading term in the bound. It is first rewrit- 
ten using the stopping times Ta,2-, Ta,3, ■ ■ ■ introduced in Section 2. Indeed, 
At-^-i = a happens for t ^ K ii and only if Ta,n = t + 1 for some n £ 
{2, . . . , r — + 1}; and of course, two stopping times Ta,n and Ta^n' cannot 
be equal when n / n'. We also note that Na{Ta,n — 1) = n — 1 for ^ 2. 
Therefore, 

T-l 

(8) ^ ^{T^a,Na{t) e C^t,/(t)/7V4t) and At+i = a| 

t=K 

T-l 

^ ^{KNa(i) e C^t,/(T)/iV,(t) and At+i = a} 

t=K 

T-l T~K+1 

= Y Y ^{^N^it) G C^t,/(T)/7V,{t) and Ta,n = t + l} 
t=K n=2 
T-K+l T-l 

= X] X] ^ C^t,/(T)/(n-l) and Ta^ri = t + l| 

n=2 t=_ft: 



n=l 



where we used, successively, the following facts: the sets C^t,^ grow with 
7; the event {^t+i = a} can be written as a disjoint union of the events 
{'Ta,n = t + l}, iov 2 ^ n — K + 1; the events {Ta,n = ^ + 1} are disjoint 
as t varies between K and T — 1, with a possibly empty union (as Ta,n may 
be larger than T). 

By upper bounding the first 



(9) no 



f{T) 



terms of the sum in (8) by 1, we obtain 

X5H^«."eCMt,/(T)/n}^T-7|^ + l+ E PKnGC,t,/(T)/n}. 
n=l ^ n>no+l 

It remains to upper bound the remaining sum: this is the object of the 
following statement, which will also be proved using case-specific arguments. 

Fact to be proven 2. For proper choices ofHj), f, and the sum 
^P{z?a,n G C^t,/{T)/n} negligible with respect to logT. 
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Putting everything together, one obtains 
f{T) 



(10) E[iV,(T)] ^ 



T-1 



t=K 



o(logr) o(logT) 

Theorems 1 and 2 are instances of this general bound providing non-asymptotic 
controls for E [A'r(a)] in the two settings considered in this paper. 

4. Rewards in a canonical one-dimensional exponential family. 

We consider in this section the case when P is a canonical exponential family 
of probabihty distributions vq^ indexed by G 0; that is, the distributions 
vq are absolutely continuous with respect to a dominating measure p on M, 
with probability density 

^(x) =exp(x6'-6(6')) , xeM; 
dp 

we assume in addition that 6 : — )• M is twice differentiable. We also assume 
that 6 C M is the natural parameter space, that is, the set 

e = 1^ G M : y exp(a;6l) dp(x) < ooj , 

and that the exponential family T) is regular, i.e., that is an open interval 
(an assumption that turns out to be true in all the examples listed below). 
For a thorough introduction to canonical exponential families, as well as 
proofs of the following properties, the reader is referred to Lehmann and 
Casella (1998). 

The derivative 6 of 6 is an increasing continuous function such that E(fg) = 
h(Q) for all ^ G 0; in particular, h is strictly convex. Thus, h is one-to-one 
with a continuous inverse and the distributions vq of T) can also be pa- 
rameterized by their expectations E(i'e). Defining the open interval of all 
expectations, I = 6(B) = (/i_, //+), there exists a unique distribution of P 
with expectation fi £ I, namely, t^i,-i(^^y 

The Kullback-Leibler divergence between two distributions ve^^e' G is 
given by 

KL(z.,, ve') = {e- e') b{e) - b{e) + b{e') , 
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which, writing fj, = E(z^^) and fj,' = Ei^vg'), can be reformulated as 
(11) 

This defines a divergence d : I x I ^ that inherits from the Kullback- 
Leibler divergence the property that /j.') = if and only if /i = n'. In 
addition, d is (strictly) convex and differentiable over I x I. 

As the examples below of specific canonical exponential families illus- 
trate, the closed- form expression for this re-parameterized Kullback-Leibler 
divergence is usually simple. 

Example 1 (Binomial distributions for n~samples). 9 = log(/n/ {n — iij), 
e = M, b{9) = nlog(l + exp((9)), I = (0,n), 

d{fi,fi) = filog — + [n- fi) log -. 

/i' n — fi' 

The case n = 1 corresponds to Bernoulli distributions. 

Example 2 (Poisson distributions). 6 = log(^), G = M, b{9) = exp(6'), 
/= (0,+oo), 

d(/i,/i') = fi' - n + filog^ . 

Example 3 (Negative binomial distributions with known shape param- 
eter r). 9 = log(^/(r + n)), Q = (-oo,0), b{9) = -rlog(l - exp(6')), 
/ = (0,+oo), 



d(/i,/i ) = rlog— — + ^log — 



r + /i /i'(r + ^) 

The case r = 1 corresponds to geometric distributions. 

Example 4 (Gaussian distributions with known variance o"^). 9 = fi/a'^, 
e = R, b{9) = a'^9^/2, / = M, 

*,M') = ^^^. 

Example 5 (Gamma distributions with known shape parameter a). 9 = 
-a/fi, e = (-00,0), b{9) = -alog{-9), I = (0,+oo), 



d(^,M') = «f4-i-iog4) 



The case a = 1 corresponds to exponential distributions. 
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For all ;U G / the convex functions d{ - , fi) and d{fj,, ■ ) can be extended by 
continuity to I = as follows: 

d{fi-,fj.)= lim d{fi' , fi) , d{fj.+ ,fi)= lim d{fi' , fi) , 

with similar statements for the second function. Note that these limits may 
equal -|-oo; the extended function d : I x lU I x I ^ [0, +oo] is still a convex 
function. By convention, we also define d{fi-,fi-) = d{fj,+ ,n^) = . 

Note that our exponential family models are minimal in the sense of 
Wainwright and Jordan (2008, Section 3.2) and thus that / coincides with 
the interior of the set of realizable expectations for all distributions that are 
absolutely continuous with respect to p (see Wainwright and Jordan 2008, 
Theorem 3.3 and Appendix B). In particular, this implies that distributions 
in D have supports in / and that, consequently, the empirical means I'ait) 
are in I for all a and t. (Note however that they may not be in / itself: think 
in particular of the case of Bernoulli distributions when t is small.) 

4.1. The kl-UCB algorithm. As the distributions in T> can be parame- 
terized by their expectation, II-p associates with each u £ (/) such that 
E(i^) G I the distribution z^t,-i(E(i/)) ^ which has the same expectation. 

As shown above, for all i/' G P it then holds that Kh(llx>{i^), i^') = 
d(E{i'), E(i/')) ; and this equality can be extended to the case where £(1^) € I. 
In this setting, sufficient statistics for Pa(i) and z5a,n are given by, respectively. 



where the former is defined as soon as Na{t) ^ 1. 

The upper-confidence bound Ua{t) may be defined in this model not only 
in terms of T> but also of its "boundaries," namely, in terms of / and not 
only /, as 



This supremum is achieved: in the case when Jlait) G /, this follows from 
the fact that d is continuous on / x /; when Jlait) = fj.+ , this is because 
Ua(t) = in the case when jUa(i) = P-, either /i_ is the only /i S / for 
which d{fi- , fi) is finite, or (i(/i_ , • ) is convex thus continuous on the open 
interval where it is finite. 

Thus, in the setting of this section. Algorithm 1 rewrites as Algorithm 2 
below, which will be referred to as kl-UCB. 




and 




k=l 



(12) 
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Algorithm 2: The kl-UCB algorithm. 

Parameters: A non-decreasing function / : N R 
Initialization: Pull each arm of {1, . . . , K} once 

for t = K to T - 1, do 

compute for each arm a the quantity 

Ua{t) = SUp|^ G / 

pick an arm At+i G argmax Ua{t) 

ae{l,...,K} 



In practice, the computation of Ua{t) boils down to finding the zero of an 
increasing and convex scalar function. This can be done either by dichotomic 
search or by Newton iterations. In all the examples given above, well-known 
inequalities (e.g., Hoeffding's inequality) may be used to obtain an initial 
upper bound on Ua{t). 

4.2. Regret analysis. In this parametric context we have /Cinf(i^, = 
(i(E(z^),^) when E(z^) G / and /j, £ I. The following theorem thus proves, 
in light of the lower bound of Lai and Robbins (1985), the asymptotic op- 
timality of the kl-UCB algorithm. Moreover, it provides an explicit, non- 
asymptotic bound on the regret. 

Theorem 1. Assume that all arms belong to a canonical, regular, ex- 
ponential family V = {uq : G B} of probability distributions indexed by its 
natural parameter space C M. Then, using Algorithm 2 with the divergence 
d given in (11) and with the choice f{t) = log(t) -|- 31oglog(t) for t ^ 3 and 
/(I) = /(2) = /(3), the number of draws of any suboptimal arm a is upper 
bounded for any horizon T ^ 3 as 



\/log(r) + 31og(log(T)) 



log(r) 27rcr2 ((i'(/i„,/i*))^ 

where cr'^i, = max { Var(z^e) : fia ^ E(f5i) ^ and where d'{ • , /i*) denotes 
the derivative of d{- , fJ.*). 



imsart-aos ver. 2012/08/31 file: klucb-HAL.tex date: October 5, 2012 



KLUCB.TEX 191 2012-10-03 13:46;07Z GARIVIER 



15 



5. Bounded and finitely supported rewards. In this section, D is 
the set J- of finitely supported probability distributions over S = [0, 1]. In 
this case, the empirical measures z?a(t) belong to and hence the operator 
Hj) is taken to be the identity. We denote by Supp(i^) the finite support of 
an element ly G J-'. 

The maximization program (4) defining Ua{t) admits in this case the 
simpler formulation 



and also admits an explicit computational solution; these two points are 
detailed in Appendix C.l. Thus Algorithm 1 takes the following simpler 
form, which will be referred to as the empirical KL-UCB algorithm. 

Algorithm 3: The empirical KL-UCB algorithm. 

Parameters: A non-decreasing function / : N — i- (0, +oo) 
Initialization: Pull each arm of {1, . . . , A'} once 

for t = K to T - 1, do 

compute for each arm a the quantity 



pick an arm At+i £ argmax Ua{t) 

ae{l,...,A'} 



Like the DMED algorithm, for which asymptotic bounds are proved in Honda 
and Takemura (2010), Algorithm 1 relies on the empirical likelihood method 
(see Owen 2001) for the construction of the confidence bounds. However, 
DMED is not an index policy, but it maintains a list of active arms — an ap- 
proach that, generally speaking, seems to be less satisfactory and slightly 
less efficient in practice. Besides, the analyses of the two algorithms, even 
though they both rely on some technical properties of the function /Cjnf, 
differ significantly. 

Theorem 2. Assume that > for all arms a and that /U* < 1. There 
exists a constant M{iya,f^*) > only depending on Va and ^* such that, with 





and 
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the choice f{t) = log(t) +log(log(t)) fort ^ 2, the expected number of times 
that any suhoptimal arm a is pulled by Algorithm 3 is smaller, for all T ^ 3, 
than 

iog(iog(r)) ^ 2^' ^ ^ 

KlntO'a.fl*) (l-M-)K,„r(l'„,M*)^ 
Theorem 2 implies a non-asymptotic bound of the form 

The exact value of the constant M{ua, is provided in the proof of Theo- 
rem 2 (see Section B.3) and relies on the variational form of /Cjnf introduced 
in Section B.l (see Lemma 4). 

6. Algorithms for general bounded rewards. In this section, we 
consider the case where the arms are only known to have bounded distribu- 
tions. As in Section 5, we assume without loss of generality that the rewards 
are bounded in [0,1]. This is the setting considered by Auer, Cesa-Bianchi 
and Fischer (2002), where the UCB algorithm was described and analyzed. 
We first prove that kl-UCB (Algorithm 2) with Kullback-Leibler divergence 
for Bernoulli distributions is always preferable to UCB, in the sense that a 
smaller finite-time regret bound is guaranteed. UCB is indeed nothing but 
kl-UCB with quadratic divergence and we obtain a refined analysis of UCB 
as a consequence of Theorem 1. We then discuss the use of the empirical 
KL-UCB approach, in which one directly applies Algorithm 3. We provide pre- 
liminary results to support the observation that empirical KL-UCB achieves 
improved performance on sufficiently long horizons (see simulation results 
in Section 7), at the price however of a significantly higher computational 
complexity. 

6.1. The kl-UCB algorithm for hounded distributions. A careful reading 
of the proof of Theorem 1 (see Section A) shows that kl-UCB enjoys regret 
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guarantees in models with arbitrary bounded distributions v over [0, 1] as 
long as it is used with a divergence d over [0, 1]^ satisfying the following 
double property: There exists a family of strictly convex and continuously 
differentiable functions : M — )■ [0,-|-oo), indexed by /i G [0, 1], such that 
first, d( • , /i) is the convex conjugate of <^p^ for all // G [0, 1]; and, second, the 
domination condition Cy{X) ^ 0e(j^)(-^) for all A G R and all v G 9Jti([0, 1]) 
holds, where Ci, denotes the moment-generating function of 

£^ : A G M ^ £^(A) = / e^^ d.v{x) . 

J[OA] 

The following elementary lemma dates back to Hoeffding (1963); it upper 
bounds the moment-generating function of any probability distribution over 
[0, 1] with expectation by the moment-generating function of the Bernoulli 
distribution with parameter which is further bounded by the moment- 
generating function of the normal distribution with mean // and variance 1/4. 
All these moment-generating functions are defined on the whole real line M. 
In light of the above, it thus shows that the Kullback-Leibler divergence dj^^R 
between Bernoulli distributions and the Kullback-Leibler divergence dquAD 
between normal distributions with variance 1/4 are adequate candidates for 
use in the kl-UCB algorithm in the case of bounded distributions. 

Lemma 1. Let u ^ 93Ii([0, 1]) and let fi = E(z^). Then, for all A G M, 
J [0,1] 

The proof of this lemma is straightforward; the first inequality is by con- 
vexity, as e^^ ^ xe^ + (1 — x) for all x G [0, 1], the second inequality follows 
by standard analysis. 

We therefore have the following corollaries to Theorem 1. (They are ob- 
tained by bounding in particular the variance term cr^^ by 1/4.) 

Corollary 1. Consider a bandit problem with rewards bounded in [0, 1]. 
Choosing the parameters f{t) = log(t) -|- 31oglog(t) for t ^ 3 and /(I) = 
/(2) = /(3), and 

4ER(/i,/"') = /ilog4 + (1 - ^) log 7 — ^ 
in Algorithm 3, the number of draws of any suboptimal arm a is upper 
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bounded for any horizon T ^ 3 as 



/ 3 \ 2(log(Si5^))' 
+ 4e + — log(log(r)) + + 6 . 

We denote by (j)E{u) = 1 ~ + E(i/) exp( • ) the upper bound on Ci, 
exhibited in Lemma 1. Standard resuhs on KuUback-Leibler divergences are 
that for ah n, fi' G [0, 1] and ah z/, u' G ([0, 1]) , 

dBE^{fJ.,fJ.') = sup{Xfi- (pn'iX)] and KL(z^, z^') ^ sup{ A E(z^) - £,,/(A)} 



(see Massart 2007, pages 21 and 28, see also Dembo and Zeitouni 1998). 
Because of Lemma 1, it thus holds that for all distributions v, v' G ([0, 1]) , 

deK.(E(z/), E(i/')) ^KL(z.,z.'), 
and it follows that in the model P = SOTi ([0, 1]) one has 

^inf (l^a, ^J*) ^ 4er(^ 

As expected, the kl-UCB algorithm may not be optimal for all sub-families 
of bounded distributions. Yet, this algorithm has stronger guarantees than 
the UCB algorithm. It is readily checked that the latter exactly corresponds 
to the choice of 

fiquAD(/W, Ai') = 2(/i - ^I'f 

in Algorithm 2. Indeed, the analysis derived in this paper gives an improved 
analysis of the performance of the UCB algorithm. 

Corollary 2. Consider the kl-UCB algorithm with d^^,^^, or equiva- 
lently, the UCB algorithm tuned as follows: at step t + 1 > K , an arm maxi- 
mizing the upper- confidence bounds 



l2a{t) + ^(log(t) +31oglog(t))/(2iV,(t)) 
is chosen. Then the number of draws of a suboptimal arm a is upper bounded as 

log(r) , 20F 
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As claimed, it can be checked that the leading term in the bound of 
Corollary 1 is smaller than the one of Corollary 2 by applying Pinsker's 
inequality deER ^ c?quad- The bound obtained in Corollary 2 above also im- 
proves on the one of Auer, Cesa-Bianchi and Fischer (2002, Theorem 1) and 
it is "optimal" in the sense that the constant 1/2 in the logarithmic term 
cannot be improved. Note that a constant in front on the leading term of 
the regret bound is proven to be arbitrarily close to (but strictly greater 
than) 1/2 for the UCB2 algorithm of Auer, Cesa-Bianchi and Fischer (2002), 
when the parameter a goes to as the horizon grows, but then other terms 
are unbounded. In comparison. Corollary 2 provides a bound for UCB with a 
leading optimal constant 1/2 and all the remaining terms of the bound are 
finite and made explicit. Note, in addition, that the choice of the parameter 
a, which drives the length of the phases during which a single arm is played, 
is important but difficult in practice, where UCB2 does not really prove more 
efficient than UCB. 

6.2. The empirical KL-UCB algorithm for hounded distributions. The jus- 
tification of the use of empirical KL-UCB for general bounded distributions 
50^1 ([0, 1]) relies on the following result. 

A result of independent interest, connected to the empirical-likelihood method. 
The empirical-likelihood (or EL in short) method provides a way to construct 
confidence bounds for the true expectation of i.i.d. observations; for a thor- 
ough introduction to this theory, see Owen (2001). We only recall briefly its 
principle. Given a sample of an unknown distribution uq, and 

denoting Vn = X]fe=i ^x^^ the empirical distribution of this sample, an 
EL upper-confidence bound for the expectation E(fo) of is given by 

(13) Um.{^n,e) = sup{e(z/') : u' e 9Jti ( Supp(P„)) and KL(i?„, v') ^ e} , 

where e > is a parameter controlling the confidence level. 

An apparent impediment to the application of this method in bandit prob- 
lems is the impossibility of obtaining non-asymptotic guarantees for the cov- 
ering probability of EL upper-confidence bounds. In fact, it appears in (13) 
that U-Ehipn-,^) necessarily belongs to the convex envelop of the observa- 
tions. If, for example, all the observations are equal to 0, then U-ELiymS) is 
also equal to 0, no matter what the value of e is; therefore, it is not possible 
to obtain an upper-confidence bounds for all confidence levels. 

In the case of (upper-)bounded variables, this problem can be circum- 
vented by adding to the support of Vn the maximal possible value. In our 
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case, instead of considering ?7EL(zAi,e); one should use 
(14) 

U(pn,£) = sup|e(i/') : u' G fOTi(Supp(Pn) U {1}) and KL{Vn, v') ^ e 



This idea was introduced in Honda and Takemura (2010), independently 
of the EL literature. The following guarantee can be obtained; its proof is 
provided in Section C.2. 

Proposition 1. Let vq e 9Jti([0, 1]) with E(z^o) G (0, 1) and letXi,. . . ,X, 
he independent random variables with common distribution G 9Jti([0, 1]), 
not necessarily with finite support. Then, for all e > 0, 



where /Cjnf is defined in terms of the model T) = T . 

For {0, l}-valued observations, it is readily seen that \J{yn^^) boils down 
to the upper-confidence bound given by (12). This example suggests that 
the above proposition is not (always) optimal: the presence of the factor n 
in front of the exponential exp(— ne) term is indeed questionable. 

Conjectured regret guarantees of empirical KL-UCB. The analysis of empir- 
ical KL-UCB in the case where the arms are associated with general bounded 
distributions is a work in progress. In view of Proposition 1 and of the dis- 
cussion above, it is only the proof of Fact 2 that needs to be extended. 

As a preliminary results, we can prove an asymptotic regret bound, which 
is indeed optimal, but for a variant of Algorithm 3; it consists of playing in 
regimes r of increasing lengths instances of the empirical KL-UCB algorithm 
in which the upper confidence bounds are given by 



where 5r ^ ^ as the index of the regime r increases. 

The open questions would be to get an optimal bound for Algorithm 3 
itself, preferably a non-asymptotic one like those of Theorems 1 and 2. Also, 
a computational issue arises: as the support of each empirical distribution 
may contain as many points as the number of times the corresponding arm 
was pulled, the computational complexity of the empirical KL-UCB algorithm 
grows, approximately linearly, with the number of rounds. Hence the em- 
pirical KL-UCB algorithm as it stands is only suitable for small to medium 
horizons (typically less than ten thousands rounds). To reduce the numerical 




sup<^ E(z^) : V G 2Jti(Supp(i?a(t)) U {1 + 6r}) and KL(i?a(t), u) ^ 
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complexity of this algorithm without renouncing to performance, a possible 
direction could be to cluster the rewards on adaptive grids that are to be 
refined over time. 

7. Numerical experiments. The results of the previous sections show 
that the kl-UCB and the empirical KL-UCB algorithms are efficient not only 
in the special frameworks for which they were developed, but also for general 
bounded distributions. In the rest of this section, we support this claim by 
numerical experiments that compare these methods with competitors such 
as UCB and UCB-Tuned (Auer, Cesa-Bianchi and Fischer, 2002), MOSS (Au- 
dibert and Bubeck, 2010), UCB-V (Audibert, Munos and Szepesvari, 2009) 
or DMED (Honda and Takemura, 2011). In these simulations, similar confi- 
dence levels are chosen for all the upper confidence bounds, corresponding 
to f{t) = log(t) — a choice which we recommend in practice. Indeed, using 
f{t) = log(t) + 31oglog(t) or f{t) = (1 + e)log(t) (with a smah e > 0) 
yields similar conclusions regarding the ranking of the performance of the 
algorithms, but leads to slightly higher average regrets. More precisely, the 
upper-confidence bounds we used were Ua{t) = Jj-ait) + y^log{t) / {2Na{t)) for 
UCB, 

_ ^ . ,^ , / 2g«(t)log(t) log(t) 

with 

/ 1 * \ 

\2 



for UCB-V, and, following Auer, Cesa-Bianchi and Fischer (2002), 



Ua{t) = Mi) + 



min{l/4, vait) + V21og(t)/iVa(t)} log(t) 



for UCB-Tuned. Both UCB-V and UCB-Tuned are expected to improve over 
UCB by estimating the variance of the rewards; but UCB-Tuned was intro- 
duced as an heuristic improvement over UCB (and does not come with a 
performance bound) while UCB-V was analysed by Audibert, Munos and 
Szepesvari (2009). 

Different choices of the divergence function d lead to different variants 
of the kl-UCB algorithm, which are sometimes compared with one another 
in the sequel. In order to clarify this point, we reserve the term kl-UCB 
for the variant using the binary Kullback-Leibler divergence (i.e., between 
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Bernoulli distributions), while other choices are explicitly specified by their 
denomination (e.g., kl-poisson-UCB or kl-exp-UCB for families of Pois- 
son or exponential distributions). The simulations presented in this section 
have been performed using the py/maBandits package (Cappe, Garivier and 
Kaufmann, 2012), which is publicly available from the mloss.org website 
and can be used to replicate these experiments. 

7.1. Bernoulli rewards. We first consider the case of Bernoulli rewards, 
which has a special historical importance and which covers several impor- 
tant practical applications of bandit algorithms (see Robbins (1952); Gittins 
(1979); and references therein). With {0, l}-valued rewards and with the 
binary Kullback-Leibler divergence as a divergence function, it is readily 
checked that the kl-UCB algorithm coincides exactly with empirical KL-UCB. 



UCB 



UCB-Tuned 



UCB-V 




10 10" Iff 

Time (log scale) 



10 10 

Time (log scale) 



10" 10 
Time (log scale) 



Fig 1. Regret of the various algorithms as a function of time (on a log-scale) in the 
Bernoulli ten-arm scenario. On each figure, the dashed line shows the asymptotic lower 
bound; the solid bold curve corresponds to the mean regret; while the dark and light shaded 
regions show respectively the central 99 % region and the upper 99.95 % quantile. 
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In Figure 1 we consider a difficult scenario, inspired by a situation (fre- 
quent in applications like marketing or Internet advertising) where the mean 
reward of each arm is very low. In our scenario, there are ten arms: the op- 
timal arm has expected reward 0.1, and the nine suboptimal arms consist 
of three different groups of three (stochastically) identical arms, each with 
respective expected rewards 0.05, 0.02 and 0.01. We resorted to = 50,000 
simulations to obtain the regret plots of Figure 1. These plots show, for 
each algorithm, the average cumulated regret together with quantiles of the 
cumulated regret distribution as a function of time (on a logarithmic scale) . 

Here, there is a huge gap in performance between UCB and kl-UCB. This is 
explained by the fact that the variances of all reward distributions are much 
smaller than 1/4, the pessimistic upper bound used in Hoeffding's inequality 
(that is, in the design of UCB). The gain in performance of UCB-Tuned is not 
very significant. kl-UCB and DMED reach a performance that is on par with 
the lower bound (1) of Burnetas and Katehakis (1996) (shown in strong 
dashed line); the performance of kl-UCB is somewhat better than the one 
of DMED. Notice that for the best methods, and in particular for kl-UCB, 
the mean regret is below the lower bound, even for larger horizons, which 
reveals and illustrates the asymptotic nature of this bound. 

7.2. Truncated Poisson rewards. In this second scenario, we consider 6 
arms with truncated Poisson distributions. More precisely, each arm 1 ^ a ^ 
6 is associated with z/q, a Poisson distribution with expectation (2 -|- a)/4, 
truncated at 10. The experiment consisted of = 10, 000 Monte-Carlo 
replications on an horizon of T = 20, 000 steps. Note that the truncation 
does not alter much the distributions here, as the probability of draws larger 
than 10 is small for all arms. In fact, the role of this truncation is only to 
provide an explicit upper bound on the possible rewards, which is required 
for most algorithms. 

Figure 2 shows that, in this case again, the UCB algorithm is significantly 
worse than some of its competitors. The UCB-V algorithm, which appears to 
have a larger regret on the first 5, 000 steps, progressively improves thanks 
to its use of variance estimates for the arms. But the horizon T = 20, 000 is 
(by far) not sufficient for UCB-V to provide an advantage over kl-UCB, which 
is thus seen to offer an interesting alternative even in non-binary cases. 

These three methods, however, are outperformed by the kl-poisson-UCB 
algorithm: using the properties of the Poisson distributions (but not taking 
truncation into account, however), this algorithm achieves a regret that is 
about ten times smaller. In-between stands the empirical KL-UCB algorithm; 
it relies on non-parametric empirical-likelihood-based upper bounds and is 
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10^ 10^ 10* 

Time (log scale) 



Fig 2. Regret of the various algorithms as a function of time in the truncated Poisson 
scenario. 

therefore is distribution-free as explained in Section 6.2, yet, its proves re- 
markably efficient. 

7.3. Truncated exponential rewards. In the third and last example, there 
are 5 arms associated with continuous distributions: the rewards are ex- 
ponential variables, with respective parameters 1/5, 1/4, 1/3, 1/2 and 1, 
truncated at Xmax = 10 (i.e., they are bounded in [0, 10]). 

In this scenario, UCB and MOSS are clearly suboptimal. This time, the 
kl-UCB does not provide a significant improvement over UCB as the expec- 
tations of the cirnis circ not particula-rly close to or to x^iax 

= 10; hence 

the confidence intervals computed by kl-UCB are close to those used by UCB. 
UCB-V, by estimating the variances of the distributions of the rewards, which 
are much smaller than the variances of {0, 10}-valued distributions with the 
same expectations, would be expected to perform significantly better. But 
here again, UCB-V is not competitive, at least for an horizon T = 20, 000. 
This can be explained by the fact that the upper confidence bound of any 
suboptimal arm a, as stated in (15), contains a residual term 3log{t) /Na{t); 
this terms is negligible in common applications of Bernstein's inequality, but 
it does not vanish here because Na{t) is precisely of order log(t) (see also 
Garivier and Cappe 2011 for further discussion of this issue). 

The kl-exp-UCB algorithm uses the divergence d{x, y) = x/y—l—log{x/y) 
prescribed for genuine exponential distributions, but it ignores the fact that 
the rewards are truncated. However, contrary to the previous scenario, the 
truncation has an important effect here, as values larger than 10 are rela- 
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tively probable for each arm. Because kl-exp-UCB is not aware of the trun- 
cation, it uses upper bounds that are shghtly too large. Yet, the performance 
is still excellent, stable, and the algorithm is particularly simple. 

But the best-performing algorithm in this case is the non-parametric al- 
gorithm, empirical KL-UCB. This method appears to reach here the best 
compromise between efficiency and versatility, at the price of a larger com- 
putational complexity. 




Time (log scale) 

Fig 3. Regret of the various algorithms as a function of time in the truncated exponential 
scenario. 

8. Conclusion. The kl-UCB algorithm is a quasi-optimal method for 
multi-armed bandits whenever the distributions associated with the arms are 
known to belong to a simple parametric family. For each one-dimensional 
exponential family, a specific divergence function has to be used in order to 
achieve the lower bound (1) of Lai and Robbins (1985). 

However, the binary Kullback-Leibler divergence plays a special role: it 
is a conservative, universal choice for bounded distributions. The resulting 
algorithm is versatile, fast and simple, and proves to be a significant improve- 
ment, both in theory and in practice, over the widely used UCB algorithm. 

The more elaborate KL-UCB algorithm relies on non-parametric inference, 
by using the so-called empirical likelihood method. It is optimal if the dis- 
tributions of the arms are only known to be bounded (with a known upper 
bound) and finitely supported. For general bounded arms, the empirical- 
likelihood-based upper confidence bounds, which are the core of the algo- 
rithm, still have a adequate level; but obtaining explicit finite-time regret 
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bounds for the algorithm itself and/or reducing its computational complex- 
ity is still the object of further investigations (see the discussion in Sec- 
tion 6.2). The simulation results show that empirical KL-UCB is efficient in 
general cases when the distributions are far from being members of simple 
parametric families. 

In a nutshell, empirical KL-UCB is to be preferred when the distributions 
of the arms are not known to belong (or be close) to a simple paramet- 
ric family and when the kl-UCB algorithm is know not to get satisfactory 
performance — that is, for instance, when the variance of a [0, l]-valued arm 
with expectation /i is much smaller than — fi). 
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APPENDIX A: PROOF OF THEOREM 1 

To prove Theorem 1, one needs to check the two "facts to be proven" 
of Section 3.1. We will do so using the choice //^ = /i* for the analysis 
parameter. 

A.l. Proof of Fact 1. Our goal is to upper bound the quantity 

T-l 



t=K 

The control consists of two steps: first, a reduction of this question to the 
application of a deviation result; second, an instantiation of the resulting 
bound to our choice of /. The deviation inequality itself is deferred to sec- 
tion A. 1.1 below. 

For all t G {K, . . . ,r — 1}, we have on the event {Ua*{t) ^ A**} that 
Jia*{t) ^ Ua*{t) ^ fi* < fi+. Therefore, for all < 5 < ^+ - Ua*{t), 

d{na*(.t), Ua*{t) + S) > ^^^^ 



Na*{t) 

Since ^a*{i) < we then have, except when Jia*{t) = Ua*{t) = that 
/ia*(t) belongs to the open interval / = (fi^^fi^); thus the discussion af- 
ter (12) on the continuity of d shows that, letting (5—7-0, 

d{f2a*(.t),Ua*{t)) ^ ^^^^ 



Therefore, since d(Jla*{t), •) is non-decreasing on , we get the 

inclusion 



{fi* >Ua*{t)} C >l2a*{t) and d{Jla*{t), fi*) ^ 



fit) 



we note that this inclusion is also valid when jua*(t) = Ua*{t) = fj,-. Decom- 
posing according to the values of Na* (t) yields 



{fi* ^ \Ja*{t)] C \^ W> %*.n and d(/2a*,n, /U*) ^ ^ L 
n=l ^ ^ 

By application of the deviation result (Lemma 2 below), the sum of in- 
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terest is thus bounded as 

T-l T-l 

(16) ^{^'* > ^»*(*)} ^ E ^^/(*) «"^^*^} 

t=K t=K 

T-l 

^1 + J2e\f{t) log(t)] e-^W, 

t=3 

where we used the fact that K ^ 2 for the second inequahty. We recah that 
f{t) = log(t) + 31og(log(t)) for t ^ 3 and these values are indeed such that 
f{t) > 1, as needed to apply Lemma 2. As log(u) ^ u — 1 for all u > — 1 
and log(t) > 1 for t ^ 3, we have 

fit) log(t) ^ log2(t) + 3(log(t) - 1) log(i) ^ 41og2(t) - 3 , 

thus e\f{t) log(t)] ^ 4elog2(t). It follows that 

T-l 



(17) J2e\f{t)log{t)]e- 

t=3 



■m 



T-l ^ / .T-l 

4e E ^ ^%3bi(3) + n^^' 



t=3 



^ 4^ (sbiM + ^°s(log(r - 1)) - log(log(3))) ^ 3 + 4elog(log(T)) , 

where we resorted to a sum-integral comparison and used the fact that 
log(log( ■ )) is the primitive function of t i— )• l/(tlog(t)). 

A. 1.1. Deviation inequality. It remains to state and prove Lemma 2, 
which is actually a consequence of the more general results provided in 
Appendix C.4. 

Lemma 2. For all e > 1, provided that ^_ < /i* < 

/ n \ 

and k d{'p,a*,k, fJ*) ^ ^} ) =^ e|'elog(n)] e ^ . 



\k=l 



Proof. We consider, with the notation of Lemma 11, the random vari- 
ables Zk = Xa*^k- We denote by 9* = b~^{fj*) the parameter in the expo- 
nential family corresponding to their common distribution; 6* lies in the 
open set 0. The random variable e'^"^"*-! is integrable for all A G M such 
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that ^ -|- A € 0; these A are in an open interval containing and denoted by 
(Ai, A2)- In addition, by definition of the densities, 

(18) E[e^^<'M] = / exp{\x + e*x-b{e*))dp{x)=exp{b{9* + X)-b{9*)); 

that is, (^(A) = logE[e'^^°M] = b{6* + X) - b{6*). As indicated when in- 
troducing the canonical exponential family, b is strictly convex and (twice) 
differentiable; therefore, so is (j). We only need to show that (p* = d{- , fi*) 
at least on fi^] 

Indeed, note that for all fi G {fj.-,fi+), the function A € (Ai,A2) ^ 
A/.i — 0(A) is strictly concave and twice differentiable, with derivative equal 
to fi — b{9* + A); this derivative is null at a unique point A^ given by 

(19) ^* + x^ = b-\^l) 

and therefore, the concave function of interest is maximized at this point, 
with value 

(20) r (/^) = A^/i - ct>{X^) = {b-\ii) - r)/. - (6(r + A^) - b{0*)) 

= (ri(^) - b-Hfi*))f, - b{b-\fi)) + b{b~\fin) = difi, 11") , 

where the final equality follows from (11). For the other values of fi, namely, 
fi = fi^ and n = fi^, we argue by continuity, as d was extended by continuity 
and since <j)* is convex thus continuous on M. □ 

A. 2. Proof of Fact 2. Our goal is to upper bound the quantity 

The control consists of four main steps. Some rewriting of the events of 
interest is first performed, to get a form that is suitable for an application 
of a Markov-Chernoff bounding (which is the second step). In the third 
step the obtained bound is further bounded in an integral form using the 
intuition of Laplace's method. This integral bound is finally controlled in an 
explicit way, using an auxiliary result proved in Section A. 2.1. 

Rewriting step. Note first that in our case 

(21) /Cinf(i^a,/i'') = inf<^ (i(^a,/^) : fl > fJ*\ = d{fla,fJ.*) , 
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where the second equahty is because the mapping d{iJ,a, ■ ) is strictly convex 
and continuous over / and achieves its minimum at /ia, thus is increasing on 
[fia, fJ'+)- Similarly, for all 7 > 0, 



: Bfie {fi*, with d{E{u), /u) ^ 7} . 



Distributions v £ with E(i/) > all belong to C^*.-,,; for distributions 

u G SUti (/) with E(i/) ^ fj.*, it follows from the same arguments as above that 
u G C^*,7 if and only if E(z/) = /i+ or d(E(z^), /i*) < 7. For ^ 7 ^ d{fia,IJ-*), 
by continuity and strict convexity of d( • , on / there exists a unique 
/i* G such that 

/i-") =7; 

distributions with d(E{i'), fi*^ ^ 7 together with E(z^) ^ ^* are then 
exactly the ones with E(z^) ^ fi*. All in all, we proved that the sets of 
interest can be rewritten, for all ^ 7 ^ d{iia,^J*), 



C^.,-, = [u(iTli{l) : E(z.)^/i;} 



Markov-Chernoff hounding. Now, for n ^ no + 1, by definition of no, we 
have f{T)/n < d{fia, fJ-*)', the probabilities of interest hence equal 



,(A/n)X„,i 



where the upper bound holds for all A > such that e^^^"'^'^'^-'^ is integrable 
and comes from a Markov-Chernoff bounding, while the last equality is by 
independence and identical distribution. Denoting A' = X/n, we have proved 
that for all A' > such that "^"-^ is integrable, 

IP{z?a,n G C^*,/(T)/n} ^ ^Xp (^-n ( A'/i}(T)/„ - 0a(A'))) , 

where <?5>a(A') = logE[e''*'"''"'''^]. Now, the same argument as in the proof of 
Lemma 2 shows that A' 1— )■ X'fi*jr^j,y^ — 4>a{X') is defined on an open interval 
of M containing 0, is maximized at the value A' > such that = 

b{6a + X'), where 6a is the parameter in corresponding to i^a, with maximal 
value equal to d[fi*j-^j,y^^, fia)- Therefore, 

g-/(T)v.(n//(T)) 



gC„ 



-'/.^/(T)/n} 

where we introduced the mapping 

1 



(p : X £ 



diHa,fJ'* 



+00 
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Integral bound (Laplace's method). Since the two mappings 

7e [0, d(/ia,M*)] ^ ^* G [^a,^1 ^ d{fj.*,fla) 

are respectively decreasing and increasing, their composition is a decreas- 
ing apphcation. Hence, the mapping ip is nonnegative and increasing; as a 
consequence, the sum of interest can be bounded as 

(22) ^{-a,neC,.j,ryn}^ E e-/(^)-M^)) 

^ re-^(^)^(^/^(^))dx = /(r) r e-^(^)^(")dn 

Jno Jno/f{T) 

POD 

^ f{T) / e-^(^)^(")dtx, 

where the last equality follows by a change of variable and the last inequality 
holds because uq/ f{T) ^ l/d{fia, fJ*) by definition of uq. An equivalent of 
the last bound in (22) can be obtained by using the standard Laplace's 
method; it is of the order of y'fjT). A non-asymptotic upper bound is now 
obtained via an explicit lower bound on ip. 

Control of the integral bound. We first prove that 

(23) V7G [o,d(/x„/x^)], /,;_/,„^_^(^^:if!hl2^o, 

where we denoted by d'{fia, fJ-*) the derivative of the mapping fi i— )• d{fi, /_f*); 
the latter exists in view of the defining expression (11) of d and is negative, 
as d is decreasing on (^_,/i*]. Indeed, d is above any tangent line as it is a 
convex function: 

\/H G {fl-, d{fla,lJ''') - d{ll,^t*) ^ d'{lla,^J*) {fla " /«) , 

which in particular entails that 

V7 G [0, d{fla, l-l*)] , d{fla, /i"^) - 7 ^ d'{l^la,fl*) il-la - A**) , 

leading to the claimed inequality as we recall that d'{fia,f^*) < 0. 
Combining Lemma 3 below with (23), we get that 
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where we recall the definition a^^ = max{Var[z>0] : /^a ^ E(z^e) ^ A**}- 
Thus, 



Vx € 



1 



+00 



ip{x) ^ 



2C72 



Bounding either x or 1/x in the expression above, the integral on right- 
hand side of (22) may be further upper bounded as 



2/d(ii„,u*) 

g-KM.,/.*)-l/n)2D„,,/(T) 



2/d{fla,fJ.*) 



<1 



where the upper bound on the second integral is by a direct calculation and 
where the constants equal 



Da, 



1 



1 



2a2 



2 and <.-2^2 ^2d'(^„^*) 



difJ.a,f^* 



The remaining integral is controlled by performing a change of variable, 
V = d{fia,fJ'*) — i/u, that is, du = {d{na, IJ-*) — v)~ dv. Thus, 

2/d(^ta,M*) 



g id{^,a,f^*)-l/ur Da,.f{T) 
l/d{fla,^l*) 

d(/ia,At*)/2 



-V^ Da,.f{T) 



1 



■ dt; 







{d{Ha,fl*)) Jo 



{d{Ha,fJ'*) - vY 



TT 



{d{fla,f^*)f 2^Da,J{T) 

Putting everything together, we obtain the bound 

(24) Yl IP{^a,n GC^*,/(T)/n} ^ 



2,/27r(j2 



d(Ma,M* 
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A. 2.1. A variant of Pinsker's inequality. It only remains to state and 
prove Lemma 3. Note that in the case of Bernoulh distributions, it corre- 
sponds to (a refinement of) Pinsker's inequahty. 

Lemma 3. Let /zi < //2 be two elements of I. Then 

a(M2,Mi) > ^ , 

where a"^ = max | Var(z^5)) : E(fe) G [;Ui,/i2]}- 

The proof of this inequahty shows that, symmetrically, the inequality 
dif^i, 1^1-2) ^ (/W2 — is also true. 



Proof. Denote 

(j){X) = log (^J exp(Ax)dz^j,_i(^^^(x) 



0^ 

and recall that by (20), (j) is twice differentiable. For fj. G [/ii,/Li2], Equa- 
tion (20) states that (j)*{fJ-) = d{fi,fii). According to Crouzeix (1977), (j)* is 
twice differentiable, with second derivative equal to 

(<^*)"(^) 



where = b ^(/i) — b ^(/ii) by Equation (19). From (18) and the general 
links between b and variances in exponential families, 

<A"(A) = 6(ri(/zi) + A)=Var(z.,_,(^^)^,). 
Hence, by definition of a^, 

m"{i^) = — 7^ — V ^ ^ • 



Thus, by Taylor's formula and as 0*(/Ui) = ((/)*)'(/ii) = because 
d{- , fii) is minimal at /Ui, we obtain 



□ 
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A. 3. Proof of Theorem 1. Combining (10) with the proofs of Fact 1 
and of Fact 2 — respectively, equations (16)-(17) and (21) — , substituting the 
expression for /, and using the fact that /Cinf (t'a, = d[^a, ^J*) as asserted 
in (24), we get that for T ^ 3, 



+ 2\/2wi 



(rf^(/^a,M^))' 
'"^ (d(/.a,M^))' 



iog(r) + 3iog(iog(r)) 



Rearranging the terms concludes the proof. 

APPENDIX B: PROOF OF THEOREM 2 

To prove Theorem 2, we consider an analysis parameter /i^ = /i* — e and 
check the two "facts to be proven" discussed in Section 3.1. We however 
first require an important result which provides an alternative variational 
expression for /Cinf. 

B.l. Variational form of /Cinf. A key element is that /Cjnf defined 
in (2) with T) = T may be given the following variational expression; see Bor- 
wein and Lewis (1991); Harari-Kermadec (2006) as well as the re-derivation 
of this result by Honda and Takemura (2011, Theorem 3). The notation Ej, 
here indicates that the random variable X has distribution v. 

Lemma 4. For all v ^ F and all fi G (0, 1), 

^inf(j^,Ai) = max EJhx,f,{X)] , 
Ae[o,i] 

where h\ „ is the mapping 



/iA,M : X G [0, 1] ^ log(^l - A ^—^^ 



The following regularity lemma will be used throughout this section; it 
corresponds to Honda and Takemura (2011, Lemma 6). 

Lemma 5. For all v ^ T , all fi e (0, 1), and all < e < fj,, 

e 



^inf (l^, /W) ■'Cinf (l^, M - e) + 



1-/.' 
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and, under the additional condition that E(z^) < fj. — e, 



2 

Lemma 5 in particular implies that for a distribution v with finite support, 
^inf (i^i ^) > if and only if E(z^) < fi . 

Indeed, in view of the original expression of /Cinf in (2), and by continuity 
of KL(z^, • ) on the set of distributions with same support as z^, we have, for 
all /_f G (0, 1), that /Cinf(z^, = as soon as E(z/) ^ fi. Now, the second part 
of Lemma 5 entails that /Cinf (i^, /^) > when E(i/) < fi. 

B.2. Proof of Fact 1. Our goal is to control the sum 

T-l 



Y,^f^*-e>Ua*{t)}. 



t=K 



First notice that the inequality fi* — e ^ Ua* (t) means that all v € J- with 
E(i/) > fx* — e are such that KL(pa*{t), u) > f{t)/Na*{t); in particular, it 
thus implies that 



inf(i?a*(i). = inf|KL(Pa*(t),z^) : e T and E(z^) > ^ 



fit) 



Na*it) ■ 

The sum of interest is thus be bounded as follows, 

T-l T-l 



t=K t=K ^ ^VaH^jJ 

< i;p{c,„,p..(t),.-)>^4}. 



where the second inequality is obtained by application of the second part of 
Lemma 5, which is legitimate as K,ini{pa*{t)-, ^J* — s) > entails E(t'a*(t)) < 
^* — e, as recalled after the statement of Lemma 5. 

By decomposing according to the values of Na*{t) as in (8), we have the 
inclusion 

t-K+l , ,,,, 2^ 



a. \ J J n=l ^ 



n 2 
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and the probability of each event in the union is upper bounded, resorting 
to Lemma 6 below, by 

P|/Cinf (i?a*,n, ^ + y } ^ e(n + 2) exp(-n(eV2 + /(t)/n) 

= e-^W(n + 2)ei-"^'/2. 

The union bound then leads to 

T-l T-lt-K+1 



t=K t=K n=l 

/T-1 \ /t-K+1 

= e 



t=K / \ 71=1 / 



It only remains to deal with the term 

t-K+l oo 
n=l n=2 

The positive mapping n e""'^^/^ is increasing on [0, 2/e^] and decreasing 
on [2/e^, +00), so that, for e ^ 1, the following series can be bounded by 
integrals. 



n=2 n=2 n=r2/e2]+l 

/■[2/e2] ^00 

A=2 Ja:=r2/£21 



00 

2 , 



00 



(where we performed the change of variable u = xe^/2 to obtain the last 
equality). Putting everything together, we have the following bound that 
completes the proof of Fact 1: 



(25) Y^{^'-^^ f^'^^w} ^ fE^"'^*^) ( 

t=K \t=K / ^ 



3e + 2 + 4r + 



£2 £4 
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B.2.1. Deviation inequality. 

Lemma 6. Let u ^ T he a distribution with expectation E(z/) = /i G (0, 1) 
and denote by a n-sample of random variables with common 

distribution u. For all e > and n ^ 1, 

1 " 

P{/Cinf n) ^ e} e(n + 2) e""'' , where Dn = - '^6z^ ■ 

^ k=l 

Proof. This result is inspired by Honda and Takemura (2012, Theo- 
rem 11) and borrows some elements of its proof, in particular the fact that, 
for all 7 > 0, there exists a set C [0, 1] with cardinality at most 2 -|- I/7 
such that 



(26) 



/Cinf (i>n, /i) ^ 7 + max ^ g log(l - A ^) . 



Assuming for the moment the existence of such a grid A^ (which will be 
re-proved below), we obtain using the union bound 

(27) F[)C,^,{Dn,f^)>e}^ J] 4^ ^^"^^ " ^ f^) ^ " " 4 ' 

AgA^ I k=l ^ ^ ) 

By Markov-Chernoff inequality, for all A G [0, 1], 

'{^ x: - ^ fr^) ^ - 4 < 4n (1 - ^ TTTf ) 

v Aj — 1 ✓ _fc — 1 



-?i(e-7) 



e 

fc=i 



l-A^^ 
l-/i 



g-n.(£-7) 



using both the independence of the Zj. and the fact that E(i/) = ^ for the 
final equalities. The bound (27) and the observation that A^ has cardinality 
at most 2 -|- 1/7 yield 



»{/Ci„f(P„,, /x) ^ 4 ^ (^2 + ^ e 



■n(£-7) 



We conclude by taking 7 = 1/n. 

It only remains to prove (26). Thanks to Lemma 4, this inequality can be 
rewritten as 

suv-j^loJl-X ^] ^ 7 + max i ^ log (l - X' ^) , 

AG[0,1] ^ ^ V 1 - ^ / ^'6^7 ^ ^ V 1 - / 
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and thus follows from the fact that for all A G [0, 1], there exists A' G 
such that, for all x G [0, 1], 

logfl-A^^ ^7 + logfl-A''^"^ 



1 — fl J \ 1 — jJL ^ 

This fact is a consequence of Lemma 7 below, by choosing the set 

= {1/2, 1} U {1/2 + 7, . . . , 1/2 + L1/(27)J7} 

U {1/2-7, ...,1/2- L1/(27)J7}, 

which has, at most, 2 + I/7 elements (Lemma 7 applies for A G [0, 1) and 
A = 1 belongs to the grid A-y). □ 

Lemma 7. For all A, A' G [0,1) such that either A A' ^ 1/2 or A ^ 
A' ^ 1/2, for all real numbers c ^ 1, 

log(l - Ac) ^ log(l - A'c) + 2|A' - A| . 

Proof. First note that the quantities 1— Ac and 1— A'c are indeed positive 
as they are respectively larger than 1 — A > and 1 — A' > 0; this is where the 
condition A, A' < 1 plays a role. The claimed inequality is straightforward 
in the case c G [0, 1] and A ^ A' ^ 1/2, as well as in the case c ^ and 
A ^ A' ^ 1/2. In the rest of the proof we consider only the other cases. 

The mapping i/^c : A G [0, 1) 1— )• log(l — Ac) is concave, differentiable, with 
a non-increasing and continuous derivative V'cl-^) — ~^/{^ ~ '^c), therefore 
i^'^{u) ^ ^^(1/2) ^ il)'^{v) for all ^ u ^ 1/2 and 1/2 < u < 1. This entails, 
via Taylor's equality with integral remainder, that 



log(l - Ac) - log(l - A'c) = / i;'^{x) dx 

Jx' 



A 

V'c(x) dx ^ ^^(1/2) (A - A') if A ^ A' ^ 1/2 and c ^ 0; 



A' 



/ (-V'c(a:))d2;^-^;^(1/2)(A'-A) if A < A' 1/2 and c G [0, 1]. 
In the first case, we note that — 1/c ^ and thus 

while in the second case, —1/c ^ —1 and thus 
V^^(l/2) = — ^ ^ 



1/2 - 1/c 1/2-1 
in all cases, the bound 2|A' — A| holds. □ 
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B.3. Proof of Fact 2. In this section we upper bound the residual 
sum Yln>no+i^{^o-'"' ^ 'where no has been chosen in (9) as 

no=\f{T)/)Ci^i{ua,fi*)]. 

Main argument. For all 7 > 0, we have the following inclusions 

C i G J- : /Cinf (z^, ^i*) ^ 7 + 

where we have used successively (7) and the first statement of Lemma 5 (for 
< e < /u*). Hence, 
(28) 

f{T) , e 



n {1 — II* 



The right-hand side is the probability that P^^n belongs to the set 

|/Cinf(- ,^^) ^ ^ + (13^} ^ 9^l(Supp(z^a)) . 

Lemma 8 below asserts that this set is closed and convex and thus that 
Sanov's inequality (Lemma 10) may be applied to upper bound the right- 
hand side of (28) as 



fiT) 



n (1 - fi*) 



where 
(29) 

Kail) = inf|KL(i/, i/a) '■ y G SCti (Supp(z^a)) such that /Cinf(z^, ^*) ^ 7| . 

Figure 4 displays pictorially the connection between the quantities f^, /i*, 
K-mi {i^a,fJ*), C^*,7 and Ha{l)- Defining the index ni ^ no such that 



ni 



^inf(i^a,/^*) 
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Fig 4. Quantities appearing in the proof represented in the two-dimensional probability 
simplex Moo ({0, 1/2, 1}) ; distributions with the same expectation lie on a common vertical 
line. Arrows stand for KL-divergence. The distribution v m the figure reaches the minimum 
in the definition of Ka{'y) given by Equation (29). 



where > is an analysis parameter, we have 

(30) -p(--'^(^ + (r^ 



^ ni - no - 1 + expi -n Ka (1^21 + ^ ) 

nt^, V V ^ (i-M'^)y, 

^ SgfiT) , ^ / ^ /Cinf(^a,/i^) 

C ^ + 2^ exp -n Ka I — h 



dcf ^ ^ 

where we used, for the second inequahty, the fact that 7 1— )• Ka(7) is non- 
increasing and where we introduced for convenience the short-hand notation 
7a(ea,e)- We show befow that there exists a constant M(fa,^*) > only 
depending on and /.i* such that 

(31) V. ^ .a^i„f(^a,^1 , ^aila{ea,e)) ^ ^^^^J^^^^^ • 

Recall that /Cjnf (t'a, A**) > as E(t'a) < ^u*; therefore, after substitution 
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in (30), and under the condition e ^ (1 — /i*) /Cmf (z^a, Ai*)/2, one obtains 

(32) Yl ^{^a,neC^^-eJ{T)/n} 

EgfiT) , 1 



/Cinf(z^a,/i*) 1 -exp(-e2/Ci„f(z.„/x*)7M(i/,,//*) 



which suffices to prove Fact 2 (in Section B.4 we provide a simpler upper 
bound of the right-hand-side which coincides with the form given in the 
statement of Theorem 2). 

Bounding Ka- It remains to show that (31) holds. First note that the def- 
inition of Ka in (29) implies that the infinimum is achieved: because of the 
finiteness of the support of Va, the function KL( • , fa) is continuous over the 
compact set 9JTi (Supp(fa)) , and its infimum is taken over a closed thus com- 
pact set (see Lemma 8). Thus, there exists an element u of 9Jti (Supp(fa)) 
such that Ka{la{£a, = KL(^^5Z^a) and /Cinf (i^, /U*) ^ Jai^ajS)- Note that 
ly depends on the two analysis parameters e and Ea- By Pinsker's inequality, 
II ~ J/ II ^ 

KL(z5, z^a) ^ 2~~^' where ||I7-i/a II 1 = E 

To obtain (31) we show below the existence of a constant C{iya, f^*) > only 
depending on Va and /U* such that ea ^inf (t'aj M*) ^ 2C(fa,/Li*) Hi? — z^a||]^ 
for all relevant values of Sa and e; then, (31) holds with M{iya,fJ*) equal 

Because the supremum is achieved in the alternative expression of /Cinf 
provided by Lemma 4, there exists G [0, 1] be such that 

)Cin{{l^a,IJ*) = E '^aiix}) log( I- Xa^^—^j ■ 

Note that Xa only depends on Va and fi*. There are two cases: either ^^({1}) > 
and then necessarily Xa < 1, or, 1 Supp{ua). Using again Lemma 4 to 
lower bound /Cinf (?, /x*) , we have 



(33) ICin{{Ua,H*) - JCin{{u,^J*) 



^ E (^a({x})-?({x}))logh-A.- 



X — fl 
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Introducing 

and bounding each side of (33) yields /Cinf(t'a, ^*)—7a(ea) e) ^ C{va,^i*) Hz?— 
i^a II In the two cases mentioned above, either because Aa < 1 or because 
all X in the support of Va are such that x < 1, the quantity C(z^a)/^*) is 
finite. The proof of (31) is concluded by noting that when e and £a satisfy 
the condition in (31), we have 

, . /Cinf(z/a,^*) , £ l+£a/2 , 

and thus, 

B.3.1. On the level sets o//Cinf(- ,/^*)- In the proof of Fact 2, we used 
the following lemma, which we now prove. 

Lemma 8. For all ^* G (0)1)? function /Cinf (•,/«*) is convex and 
continuous over SlJti (Supp(t'a)) . In particular, the sets {/Cinf(- ,//*) ^ 7} H 
OJti (Supp(z^a)) o'^e closed convex subsets 0/ (Supp(fa)) , /or oZZ 7 > 0. 

Proof. We first show that /Cinf ( • , n*) is a convex function. Fix a G [0, 1], 
two distributions 1^1,1^2 S J-", and consider two other distributions , 1^2 ^ 
with E(z^() > and E(z^2) > f^*- 

A^inf (az^i + (1 - a)i'2, ^ KL(az^i + (1 - a)i^2, az^i + (1 - Q;)'^2) 

^QKL(z.i,i/0 + (l-a) KL(i/2,z^2), 

where the first inequality is by definition of /Cinf and the fact that az^J + (1 ~ 
0)1^2 still has an expectation larger than and where the second inequality 
follows from the joint convexity of the Kullback-Leibler divergence (see, e.g.. 
Cover and Thomas 1991, Theorem 2.7.2). By taking infima over the possible 
u'l and 1^2, we get 

^inf(ai^i + (1 - a)u2, fi*) ^ a/Cinf(i/i,^*) + (1 - a) /Cjnf (z^2, A*"^) , 
showing that he mapping v G T ^ /Cinf (z^, ^*) is convex. 



C(fa, //*) = max 

a;gSupp{!^a) 
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We now turn to the continuity of /Cinf ( • , /^*). We first show that it is 
bounded; indeed, JCia{{u, fi*) = for all u G 9Jti (Supp(i/a)) with E(z^) ^ fi*, 
while for all u £ 9Ki (Supp(z/a)) with E(i/) < fi*, 



a;GSupp(i') 



(l_^,.)/(l_E(^))..({x}) 



, l-E(z^) , 1 
^ log- ^^log- 



I- IJL* ° I- fl* ' 

where the first inequality holds by convexity of the Kullback-Leibler diver- 
gence. The function /Cjnf (•,//*) is therefore a bounded and convex func- 
tion defined over the simplex dJti (Supp(z^a)) ; consequently, it is upper semi- 
continuous (see Rockafellar 1970, Theorem 10.2). It suffices to show that 
^inf ( ■ ) /^*) is lower semi-continuous over Tli (Supp(fa)) . Using the notation 
of Lemma 4, for all G 9Jti (Supp(z/a)) , the mapping 

AG [0,1] ^ E,[hx,A^)] = E KW) l°g(l-^T3^) 

is continuous (with — oo as a possible value at A = 1). The result of Lemma 4 
can thus be rewritten as indicating that for all v G 9Jti (Supp(t'a)) , 

^inf(i^,M*)= sup E^[hx,f,*{X)] . 
Ae[o,i) 

Now, for each A G [0,1), but not necessarily for A = 1, the mapping v G 
(Supp(fa)) I—)- [/iA,^(Ar)] is continuous. The supremum of continuous 
functions being lower semi-continuous, this concludes the proof of the lower 
semi-continuity of /Cinf ( ■ , fJ*)- D 

B.4. Proof of Theorem 2. Combining (10) together with the proofs of 
Fact 1 and of Fact 2 — more precisely, with the upper bounds (25) and (32) — , 
we obtain that under the conditions e < /i* and e ^ (1— /i*)/(2ea/Cinf(fa,/i*)), 
the expected number of times the suboptimal arm a is pulled satisfies 

(34) + ^"/^^^ , + 
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We further upper bound the right-hand side of (34) to simphfy the form of 
the bound. First, as e < fi* < 1, 

4 8e , 1 36 

3e + 2 + ^ + ^^(3e + 2 + 4 + 8e)^^^. 

Second, because of the choice f{t) = log(t) + log(log(i)) for t ^ 2, 



trt, " h 21og(2) ^ A t \og{t) 

^ 2bi(2) " ^°s(log(2)) + log(log(r - 1)) ^ 2 + log(log(T)) , 
where we used the same arguments as in (17). Third, we have 



l-exp(-e2/Ci„f(z.„,^i'^)7M(i/„/x*)) " e2/Ci„f(z.„^*)2 ' 



as a consequence of the bound, 1/(1 — e~^') ^ 2 + 2/x, vahd for all a; > 0. 
The latter is obtained by distinguishing two cases: if a; ^ 1, then 1 — e"^ ^ 
1 — > 1/2; for < x ^ 1, we have ^ 1 — x + x'^ /2 and thus 



11 12 
^ ^ = — — ^ - 



1 — e ^ X — x^/2 x(l — x/2) X 
Putting these three upper bounds together, (34) implies that 

(35) E[Na{T)] ^ +^(2 + log(log(T) 

This bound could be optimized over the admissible values of the analysis 
parameters e and e^. For the sake of readability however we only provide 
convenient values that balance the (orders of magnitude of the) two main 
terms of the bound, that is, the second and third terms on the right-hand 
side of (35). These values are e = //* (log(T)) and Sa such that e ^ 
(1 — fi*)/ (2ea/Cinf (^ai M*)) ; because T ^ 3, the condition e < fi* is satisfied. 
Substituting these values, we get 

2/.-/(r) (log(r))-^/^ (i-M-)2M(z..,^-) ..^.^2/5 ^ , 
+ (i-^^)K;,„f(..,/.^)^^ 2{,^? 
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Replacing f{T) with its value log(T) -|- log(log(T)) and bounding the quan- 
tity log (log(T)) (log(T)) by 1 concludes the proof. 

APPENDIX C: MISCELLANEOUS RESULTS 

C.l. Maximizing the expectation under KL constraint. In this 
section, we provide an algorithm to compute 

(36) U{u, 7) = sup|e(z^') : ly' and KL{iy, v') ^ 7} , 

for 7 > and v ^ T . 

C.1.1. Reduction to a finite- dimensional convex program. We first show 
that in (36) T can be replaced by 9Jti (Supp(i/) U {1}) without altering the 
value of C/(z^, 7). To do so, we prove below the equality 

sup|e(i/') : I/' G aHi([0, 1]) and KL{iy, v') s: 7} 

= sup|e(i/') : v' G aHi(Supp(i/) U {1}) and KL(i/, v') ^ 7} , 

which implies, by a sandwich argument, that both of these quantities are 
also equal to C/(i^, 7). 

We first establish that sup {E(z^') : v' G aHi([0, 1]) and KL(z^, v') ^ 7} is 
achieved for some G 9Jti([0, 1]). We then prove that has a support 
included in Supp(i^) U {1}. 

We equip the set 9?Ti([0, 1]) with the vague topology, i.e., the minimal 
topology such that, for all continuous functions / : [0, 1] — )• M, the mappings 
Mf : ly' e 9Jti([0, 1]) H> M/(z^') = [fiX)] are continuous. Prokhorov's 
theorem indicates that 5[Ri([0, 1]) is then a metrizable and compact space. 
Since the mapping i/' G 5[)Ti([0, 1]) i-)- KL(z^, ly') is lower semi-continuous 
(see, e.g., Chaganty and Karandikar 1996), {v' e Tli ([0, 1]) : KL(z/, ly') ^ 7} 
is a closed and thus compact subset of 9JTi ([0, 1]) . Its image by the continu- 
ous mapping Mj^j, where Id : x G [0, 1] 1— t- x is therefore a compact subset of 
[0, 1]. Thus, the supremum of {E(i/') : u' G 9?Ti([0, 1]) and KL{u, v') ^ 7} 
is achieved, at a distribution v^. 

Consider now the Lebesgue decomposition of i/*, 

i., = Ai/r + (l-A)^^f^ 

where A G [0, 1] and where i^^*^ is a probability measure that is absolutely 
continuous with respect to i.e., that has support included in Supp(i^), 
while z/^'"^ is a probability measure that is singular with respect to i.e.. 
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^'^'"^(Supp(z^)) = 0. Defining tlie probability measure z? = A v^'^ + (1 — A) 6i 
and using the short-hand notations Ux, Vx, z^^^xj and v^'^ instead of z^({x}), 
and z>*'^({rc}), respectively, we have 



KL(i/,i/^) - KL(z^,?) 



a;eSupp(i/) ' 

if 1 Supp(i/); 

A<? + l-A 



z^i log ■ 



if 1 G Supp(i^). 



In all cases, KL(i/, z^) ^ KL(z^, i/^) ^ 7. Therefore, in view of the maximality 
of E(i/^) under the latter constraint. 



5$ n^. 



A) / {x 

'[0,1] 



1) di^r^x) ■ 



and thus, either A = 1 (and therefore, z/* has a support included in the one 
of v), or, z/""^ = 61, which corresponds to the case where z/^ has support 
included in Supp(i^) U {!}. 

C.1.2. Algorithm for computing C/(z^, 7). Because of the reformulation 
U{iy,j) = sup{E(z^') : u' G ajti (Supp(i/) U {!}) and KL{iy,u') ^ 7}, where 
we recall that 1/ has a finite support, U{i', 7) appears as the value of a convex 
program which we restate under the following simpler form. 

Let n be a positive integer and fix n elements of [0, 1], the larger of them 
being equal to 1, denoted by ^ xi < • • • < x„_i < Xn = I- Probability 
measures over this set {xi, . . . , x„} are identified with n~tuples {qi, . . . , qn) 
such that qi ^ Q for all i and qi + . . . + Qn = 1- A. probability distribution 
(pi, . . . is given and the optimization problem at hand is to 



(37) maximize qiXi under the constraints < 



' Vi G {!,... ,n}, ^ 0; 
g'l + . . . + gn = 1 ; 
Vi_ 

i 



^Pilog— ^ 7, 



In Nilim and El Ghaoui (2005), a similar problem arises in a different 
context, and a somewhat different solution than the one exposed below is 
proposed for the case when the pj are all positive (see also Filippi, Cappe 
and Garivier 2010). However, note that in the case of the computation of 
U{y^ 7), the identification of Supp(z^)U{l} to {xi, . . . , x^} is such that > 
for l^i^n— l(a condition assumed to be satisfied in the rest of this 
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section); but it can happen that p„ = 0, this is actually the case if and only 
1 ^ Supp(i/). The optimization problem is trivial if n = 2 and Pn = 0; we 
thus assume in the sequel that either n ^ 3 or p.„ > 0, and in both cases, 
two components at least of (pi, . . . ,pn) are positive. The solution {q^, . . . , q*) 
of (37) may be computed numerically by the following algorithm. 

Algorithm 4: Maximization of the expectation under KL constraint. 

Parameters: A set X = {a::i, . . . , a::„} with ^ xi < ■ ■ ■ < Xn — 1; a, probability 
distribution (pi, . . . ,p„) on X with Pi > for l^i^n — l;a level 7 > 

Definitions: Let a = 1 when pn > and a = Xn-i when pn = 0; consider the mapping 

rt / ^ 

(38) g : (a, ^ g{£) = ^p, log(£ - x,) + log ^ ^ 

1=1 \i=l 

if J9n = and g{l) < 7 tiien 
Let r = exp((?(l) — e) 
for i — 1 ton — 1 do 

*_ Pi/il-x,) 

Hi ' 



Let = 1 ~ r 
else 

Find the root £ of the equation g{£) = 7 for i = 1 ton do 



Lemma 9 entails that the equation g{i) = 7 admits a unique solution that 
can be computed using any numerical method (Newton's search or even 
simple dichotomy) as is a convex decreasing function from (a, 00) onto 
(0,00), where a = 1 or a = as specified in Algorithm 4. In order to 

upper bound or to provide an approximate value of the root of the equation, 
one can use the following Taylor series approximation of the function g 
(detailed calculations are omitted) as ^ — ^ -|-oo: g{£) = c7^(p)/(2£^) +o[i~'^) 
where fT^(p) = X]"=iPi(l -Pi)xf. 

To derive Algorithm 4, observe that (37) is a linear program under convex 
constraints with Lagrangian 

(39) £((/!,..., 4,..., 4) = 



i=i \i=i * / \i=i ) i=\ 



■IHl 1 
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since we proved in Section C.1.1 the existence of a solution, we know that 
it is characterized by the Karush-Kuhn- Tucker conditions. 

More precisely, the argument {Qi, ■ ■ ■ ,qn) solving (37) must first satisfy 
the KL constraint in (37), which implies that q* > as soon as pi > 0, 
that is, at least for all i ^ n — 1. Also, from (39), there exist real numbers 
i,i',£l,...,£n with 

(40) Vi = 1,. . . ,n s.t. Pi > 0, = Xi+£'^ + ei-e; 

q* 

(41) ifpn = 0, = Xn + in-i; 

Vi = 1, . . . , n, ^ £{-, 



n 



(42) = £'[^5^paog|-7 

(43) Vi = l,...,n, = £^qt. 



Note that (40) and subsequent equations show that ^ ^ as well. Also, for 
all i such that qi > 0, that is, at least for i ^ n — 1, we get from (43) that 
£i = 0, which, after substitution in (40) and provided that £' > 0, leads to: 

(44) for all i with pi > and q-i > 0, £ > Xi and q* = £' * 



£ 



First case: pn > 0. As (40) is valid in this case for at least two i with 
£i = 0, we necessarily have £' ^ 0. Because (44) is then valid for all i ^ n, 
we have £ > Xn = I and by summation £' Y17=i Pi/i^ ~ ^i) — Sr=i 'ii — ^ 
and thus £' = {Y17=iPi/i^ ~ ^0) • Hence, 

^ Pi/{£ -Xi) 

}2j=iPj/{^-xj) 

The parameter £ can be characterized as follows: substituting the above 
expression into (42) yields 

n / n \ 

where g is defined in (38). 
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Second case: pn = and g{l) ^ 7. Lemma 9 below shows that g is in this 
case a continuous decreasing mapping from (x„_i, +00) onto (0, +00); there 
thus exists a unique i £ [1, +00) such that g{£) = 7; in particular, £ > Xi for 
all i ^ n — 1. It can be checked directly that the distribution 



E•=lP,/(^ 



q*=0 and qt = ^^-i ,,T for i < n - 1 



as well as the Lagrange multipliers i' = {Ylll=i Pi/i^ ~ ^if) ^ 1 = for 
i ^ n — 1, and ^„ = ^ — x„ = ^— 1 satisfy the constraints in (37) as well as 
the conditions (40)-(43). 

Third case: pn = and g{l) < 7. First note that (40) and (41), together 
with the fact that ^1 = and gjf > 0, imply that i = Xn + £n ^ Xn > xi and 
£ = xi+£'pi/ q\] thus it must be that £' > 0. Thus, legitimately applying (44), 
we have i > Xn-i and q* = £'pi/{£ — Xi) for i ^ n — 1. Also, as g{l) < 7, 
Lemma 9 shows in this case that there exists a unique C, G (xn-i, 1) such 
that g{Q = 7. 

Now, we prove by contradiction that > 0. Indeed, if = 0, then the 
same calculations as in the first case would show that g[t) = 7, thus that 
£ = C, < 1 as g IS decreasing. The contradiction would be that (41) leads 
to Xyi — 1 as noted above. Therefore, we have > and thus £n — 
by (43) and £ = 1 by (41). The solution q* can be rewritten, for i 7^ n — 1, 
as q* = i'pi/{l — Xi), so that, after substitution in (42) and using p„ = 0, 
we get 



n—l n—1 

7 = 



^Pi log ^ = J2Pi log — 



9* 

or, equivalently, 

n— 1 /n—l 



log(/) = -7 + J^p, log(l - Xi) = -7 + 5(1) - log ( 



_ Pi 
1 

1=1 \i=l 

Hence, £' = exp{g{l) —7) / i^^ZiPj/i^ ~ from which we conclude 

that < = 1 - E"=i' ?! = 1 - exp (5(1) - 7). 

C.1.3. Properties of the function g. We prove in this section the following 
lemma regarding the function g. 

Lemma 9. Let a = 1 if pn > and a = Xn-i if Pn = 0. The function 
g defined in (38) is a convex (thus continuous) decreasing mapping from 
(a, 00) onto (0, 00). 
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Proof. We will make repeated uses in this proof of random variables Zi 
defined for l> a and taking values l/(^ — Xi) each with probability pj, for 
i G {1, . . . , 77,}. Note that because of our assumptions on (pi, . . . which 
entail in particular that it has at least two different positive components, 
the random variables Zn are not almost-surely constant. 

The derivative of g equals 

n ^ 

1 ^^'''{i-x.f _ (E[Z,])^-E[Z|] 
i=\ * 

and as Zi is not almost-surely constant, Jensen's inequality shows that 
g'{€) > 0; hence g is decreasing. 

We recall that Pn-i > so that the probability pa put on a by (pi, . . . ,p„,) 
is always in (0, 1). Using Taylor expansions it can then be checked that 
g{i) = log [pai^ — a)^""-*^) when £ — a and, hence, that it tends to +00 in 
a. Likewise, g{£) = 0{l/i) when i — )• +00 and thus decreases to zero. 

We conclude the proof by showing the convexity of g; to do so, we show 
that its second derivative is nonnegative. 

9"{^) = - 




-E[Z|] {E[Zi]Y + 2E[Zf]E[Ze] - (e[Z'^ 



The Cauchy-Schwarz inequality ensures that on the one hand, {E[Ze]y ^ 
E[Z|] and thus that E[Z|] iE[Ze]f ^ (E[Z|])^ and that on the other hand, 

(E[Z|])^ = {E[Z^^^Zy^]f ^ E[Zf]E[Ze]. These two inequalities show that 
g"{£) ^ 0, as claimed. □ 

C.2. Proof of Proposition 1. We merely sketch the proof of Propo- 
sition 1, based on the proof of Lemma 6. The same arguments as the one 
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at the beginning of Section B.2, and in particular, the sandwich equahty 
described in Section C.1.1, show that 

where /Cjnf is defined as in (2) with the model V = F. So, the question is 
only whether the result of Lemma 6 holds without the assumption that the 
underlying distribution at hand is in The answer is seen to be positive. 
Indeed, the mentioned proof relies first on a control of ICmiipn^ E(z/o)) , which 
is based on Lemma 4; the latter is applied therein to Pri, which has finite 
support, even if the underlying distribution is not discrete. As for the second 
part of the proof of Lemma 6, it consists only of the application of a union 
bound and of a Chernoff bounding: it thus holds true as well. 

C.3. Sanov's inequality. We consider a sequence ^1,^2,... of real 
random variables, independent and identically distributed according to a 
distribution u with finite support S. For all integers n ^ 1, we denote the 
empirical distribution corresponding to the first n elements of the sequence 
by 

1 " 

k=l 

The following lemma, used in Section B.3 for proving Theorem 2, is a 
straightforward consequence of Dembo and Zeitouni (1998, Exercise 2.2.38). 

Lemma 10. Let C be a closed and convex subset o/9JTi(5). Then, for all 
¥{un G C} ^ exp^-n inf KL{k,u)^ . 

C.4. Deviation inequality. In this section, we prove the following 
maximal inequality that is needed in Section A.l to prove Theorem 1. 

Lemma 11. Consider a sequence Zi, Z2, ■ ■ ■ of independent and iden- 
tically distributed real random variables with common expectation /io o-nd 
denote by Zn = (1/^^) X]fc=i ^fc their empirical means. Assume that there 
exists an open interval (Ai, A2) o/M containing and a strictly convex, con- 
tinuously differentiable function (j) : (Ai, A2) — )• M such that 

VAg(Ai,A2), logE[e^^i] ^(/<(A). 
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Then for all e > 1, 

^([j{l^^o>Zk and k 4>* (Zk) ^ e} j ^ e \e log(n)] e"^ , 

where (p* : M — )■ M U {+00} is the convex conjugate of (j) defined by 

(45) VzGM, (i)*{z)= sup {Az-0(A)}. 

AG(Ai,A2) 

As explained below, fXQ, the expectation of the Z^, is the argument of the 
global minimum of (j)*, with (/)*(/_io) = 0; deviations of the empirical averages 
Zk from the mean ^0 are here considered in terms of deviations of cl)*[Zk) 
from 0. 

Note that the bound in the lemma holds actually for all e > as soon 
as n ^ 3, as it is a trivial bound (larger than 1) for e ^ 1 and n ^ 3. 
Also, symmetric arguments show that under the same assumptions, a similar 
deviation bound holds true also for deviations to the right: 

^(^\j{l^o<Zk and k(t>*{Zk)^e^^ !^e\e login)] e-' . 

In this article, however, we only need a control of the deviations to the left. 

Some properties of (j)* . We start by reviewing some useful properties of 
(f)* . First note that (p* is nonnegative (as can be seen by taking A = in 
its definition) and is strictly convex on {(/>* < +00} (see Rockafellar 1970, 
Chapter 26). Denoting by fiQ = (j)'{0) the common expectation of the Z^, 
we note that by Jensen's inequality, (j){X) ^ A/Uq and hence that (/>*(/xo) = 0, 
showing that ^0 is the argument of the global minimum of (j)*. In particular, 
in view of its strict convexity, (p* is non-increasing on {—oo,fio), and even 
continuous and decreasing on the interval (—00, fio)r]{(j)* < +00}; symmetric 
properties hold on {fiQ,+oo). 

We now underline the fact that for all z ^ fXQ, 

(46) (p^z) = sup^Xz - (piX) : Ag(Ai,0) and Xz - (p{X) > 0^ . 
Indeed, denote by 

-(A^ : A G (Ai,A2) ^ Az-<^(A) 

the function to maximize. It is strictly concave. If there exists A^ G (Ai, A2) 
such that Tp'^{Xz) = z—(p'{Xz) = 0, then (p*{z) = ipz{Xz); since (p'{0) = fiQ and 
<p' is increasing, we get from (p'{Xz) = z < fiQ that A^ < 0. This proves (46) 
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in this case. The remaining case is when il^'zi^) for all A G (Ai, A2). By 
continuity and since '(/'^(O) = z — /^o < 0, this means that -0^ < on (Ai, A2), 
that is, '4>z is decreasing on (Ai,A2). The defining supremum of (\)*{z) thus 
corresponds to the limit, as A — )■ Ai, of 0'^(A). Since '(/'^(O) = 0, all the values 
A G (Ai,0) are such that '0z('^) > while values A G [0,A2) are such that 
V'z(A) ^ 0; this proves (46) in this case as well. 

Proof of Lemma 11. We start with a peeling argument and to that 
end, introduce no = and = \rf^\ , for some 7 > 1 that will be chosen at 
the end of the proof. We denote by M = [(log n) / (log 7)] an upper bound 
on the number of elements in the peeling. Then, hm ^ n and 

v(^\j{fio>Zk and fe0*(Zfe)^e}^ 

M I \ 

/io > and <^*{Z^ ^ . 

Let m G { 1 , . . . , M} . Since 0* is decreasing and continuous on the interval 
(— 00,/io) n {0* < +00}, either 0* < e/nm on this interval and the m-th 
probability in the sum above is null; or there exists a unique G (—00, //q) 
such that (j)*{zm) = e/rim- In this case, using again 0* is non- increasing on 
(— 00,/io), we have, for all A < 

rim \ 

[j {^0 > Zk and cP'{Zk) ^ e/rim] = P U {^k^zj^ 

rim ' 

IJ |exp(A(Zi + ... + Zfc)-A:Az^) ^ l| 

fc=n.m-i+l ^ 

rim 

U {exp(A(Zi + . . . + Zfc) - A:0(A)) ^ e^(A-'"-<^W)| 




^k=nm-i+l 
n„ 



P U |exp(A(Zi + ... + Zk)-k 0(A)) ^ e^rim-i+mzm-m) | 



fc=nm_l+l 



where the last inequality was obtained under the additional assumption that 
the considered A < is such that Xzm — 0(A) > 0. 

The fact that the Z^ are independent and identically distributed together 
with the definition of 0, imply that for all A G (Ai,A2), the process (Wa,^) 
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defined by Wxfi = 1 and Wx,k = exp(A(Zi + . . . + Zu) - k(j){\)) for k ^ 1 
is a positive supermartingale. As a consequence, Doob's maximal inequality 
entails that 



U 

i k=nm-l + l 



exp 



)iXzm-(PiX)) 



i=l 



max 



exp A ^ Zi - A: 0(A) i ^ e^n^-i+^ 



i=l 



'(nm~l+l){XZm-<l>{X)) 



The above bound being valid for all A > with Xzm — </'(A) > 0, we finally 
get from (46) that 



U 

i k=nm-i+l 



fiQ > Zk and 



In view of the definition of Zm, the right-hand side may be bounded by 
exp(-(nm_i + l)(/)*(zm)) = exp(-e(nm-i + 1)/^™) ^ exp(-e/7), where 
we used the fact that by definition, n^-i + 1 ^ ■y'^-i and rim ^ 7^" (for all 
m ^ 1). Putting everything together, we have proved that, in all cases. 



|j|^fo>Zfc and k 

\k=l 



log n 
log 7 



-e/j 



Choosing 7 = e/{e — 1), which is legitimate for e > 1, and applying the 
inequality log(l + x) ^ x/(l + x) to x = l/(e — 1) > —1, one obtains 



\j{fio>Zk and k<P*{Zk)>e} 

\k=l / 



log(n) 



log(e/(e-l)) 
^ [elog(n)] e-=+^ 



-£+1 



□ 
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